Design of Improved Web Crawler By Analysing Irrelevant Result
نویسندگان
چکیده
A key issue in designing a focused Web crawler is how to determine whether an unvisited URL is relevant to the search topic. Effective relevance prediction can help avoid downloading and visiting many irrelevant pages. In this module, we propose a new learning-based approach to improve relevance prediction in focused Web crawlers. For this study, we chose Naïve Bayesian as the base prediction model, which however can be easily switched to a different prediction model. The performance of a focused crawler depends mostly on the richness of links in the specific topic being searched, and focused crawling usually relies on a general web search engine for providing starting points. Key Terms: URL; focused crawler; classifier; relevance prediction; links; search engine; ranking Full Text: http://www.ijcsmc.com/docs/papers/August2013/V2I8201356.pdf
منابع مشابه
A Tool for Link-Based Web Page Classification
Virtual integration systems require a crawler to navigate through web sites automatically, looking for relevant information. This process is online, so whilst the system is looking for the required information, the user is waiting for a response. Therefore, downloading a minimum number of irrelevant pages is mandatory to improve the crawler efficiency. Most crawlers need to download a page to d...
متن کاملAn Effective Focused Web Crawler for Web Resource Discovery
In the given volume of the Web and its speed of change, the coverage of modern search engines is relatively small. Web crawling is the process used by search engines to collect pages from the Web. Therefore, collecting domain-specific information from the Web is a special theme of research in many papers. In this paper, we introduce a new effective focused web crawler. It uses smart methods to ...
متن کاملA Tool for Web Links Prototyping
Crawlers for Virtual Integration processes must be efficient, given that VI process is online, which means that while the system is looking for the required information, the user is waiting for a response. Therefore, downloading a minimum number of irrelevant pages is mandatory in order to improve the crawler efficiency. Most crawlers need to download a page in order the determine its relevance...
متن کاملReinforcement-Based Web Crawler
This paper presents a focused web crawler system which automatically creates a minority language corpora. The system uses a database of relevant and irrelevant documents testing the relevance of retrieved web documents. The system requires a starting web document to indicate where the search would begin.
متن کاملIntelligent Web Navigation ?
Virtual integration systems retrieve information from several web applications according to a user’s interests. Being an online process, response time is a significant factor. Usually web pages contain a high number of links, some of them leading to interesting information, but most of them having other purposes, like advertising or internal site navigation. Traditional crawlers follow every li...
متن کامل